SwiGLU MLP: parameter-neutral gated activation over LeakyReLU^2 by they-call-me-god · Pull Request #676 · openai/parameter-golf

they-call-me-god · 2026-03-25T04:12:57Z

Summary

Replace LeakyReLU(0.5)^2 with SwiGLU gating — the same multiplicative activation used in LLaMA, Mistral, Gemma, and PaLM.

Built on the PR #549 SOTA stack (LeakyReLU² + Legal TTT + Parallel Muon). Single change, zero parameter increase.

The Change

# Before (SOTA)
x = F.leaky_relu(F.linear(x, up_w), negative_slope=0.5).square()
out = F.linear(x, down_w)

# After (SwiGLU)
half = up_w.shape[0] // 2
gate = F.silu(F.linear(x, up_w[:half]))   # learned gating
up   = F.linear(x, up_w[half:])
out  = F.linear(gate * up, down_w)

Parameter Neutrality

Bank	Old shape	New shape
mlp_up_bank[i]	(1536, 512)	(2048, 512) — gate\|\|up
mlp_down_bank[i]	(512, 1536)	(512, 1024)

Proof: 2 × 512 × 1536 = 3 × 512 × 1024 = 1,572,864 per layer.

Status

Training logs pending (RunPod 8×H100). Will update with 3-seed results and final val_bpb.

New env vars

USE_SWIGLU=1 (default on)
SWIGLU_HALF_DIM=1024

Replace LeakyReLU(0.5)^2 with SwiGLU (silu gate * up projection). Same parameter count: 3*512*1024 = 2*512*1536 = 1,572,864 per layer. All other SOTA settings preserved (TTT, Parallel Muon, int6+lzma, etc.) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SwiGLU MLP: parameter-neutral gated activation over LeakyReLU^2#676

SwiGLU MLP: parameter-neutral gated activation over LeakyReLU^2#676
they-call-me-god wants to merge 1 commit intoopenai:mainfrom
they-call-me-god:swiglu-submission

they-call-me-god commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

they-call-me-god commented Mar 25, 2026

Summary

The Change

Parameter Neutrality

Status

New env vars

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant